ChinaX iva (ERAT 


202302.00222v1 


chinaXiv 
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Abstract: Foxtail millet ear detection and counting are essential for the estimation of foxtail millet production and 
breeding. However, traditional foxtail millet ear counting approaches based on manual statistics are usually time- 
consuming and labor-intensive. In order to count the foxtail millet ears accurately and efficiently, an adaptive anchor 
box adjustment foxtail millet ear detection method was proposed in this research. Ear detection dataset was firstly es- 
tablished, including 784 images and 10,000 ear samples. Furthermore, a novel foxtail millet ear detection approach 
based on YOLOv4 (You Only Look Once) was developed to quickly and accurately detect the ear of foxtail millet in 
the specific box. For verifying the effectiveness of the proposed approach, several criteria, including the mean aver- 
age Precision, F\-score, Recall and mAP were employed. Moreover, ablation studies were designed to validate the 
effectiveness of the proposed method, including (1) evaluating the performance of the proposed model through com- 
paring with other models (YOLOv2, YOLOv3 and Faster-RCNN); (2) evaluating the model with different Intersec- 
tion over Union (IOU) thresholds to achieve the optimal JOU thresholds; (3) evaluating the foxtail millet ear detec- 
tion with or without anchor boxes adjustment to verify the effectiveness of the adjustment of anchor boxes;(4) evalu- 
ating the changing reasons of model criteria and (5) evaluating the foxtail millet ear detection with different input 
original image size respectively. Experimental results showed that YOLOv4 could obtain the superior ear detection 
performance. Specifically, mAP and Fl-score of YOLOv4 achieved 78.99% and 83.00%, respectively. The Preci- 
sion was 87% and the Recall was 79.00%, which was about 8% better than YOLOv2, YOLOv3 and Faster RCNN 
models, in terms of all criteria. Moreover, experimental results indicates that the proposed method is superior with 
promising accuracy and faster speed. 
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1 Introduction 


Effective foxtail millet breeding will increase 
food production and ensure food security. Thus, the 
estimation of foxtail millet production has become a 
research issue since it plays an important role in its 
breeding. Foxtail millet production is mainly deter- 
mined by three factors, namely the number of ear, 
the number of grains per ear, and the quality of 
grains". The contributions of these three factors to 
production are: grain number per ear>ear number> 
grain quality”. Therefore, accurate estimation of ear 
is of key importance to foxtail millet production. 
However, the traditional manual estimation ap- 
proach is subjective and inefficient. The deep neural 
networks can be utilized to detect foxtail millet ear 
efficiently and accurately, and the detected boxes of 
foxtail millet ear can be further employed to facili- 
tate the estimation of foxtail millet production. 


lin recent year, bene- 


For wheat ear detection 
fiting from the rapid development of deep learning 
and the great improvement in the performance of 
hardware devices, neural networks have received a 
lot of attention in the fields of target detection", se- 
mantic segmentation, and instance segmentation". 
Lu” proposed a wheat ear recognition approach 
based on back propagation (BP) neural network. In 
2013, Shi"! first extracted the color, shape and tex- 
ture parameters of wheat grains, and then utilized 
BP neural network to classify wheat grains. Further, 
mean error square (MES) and mean impact val 
(MIV)-BP were employed to optimize the BP neu- 
ral network. Their experimental results showed that 
the recognition rate had increased by 11.45% when 
compared with models without optimization. Zhang 
et al.™ designed a winter wheat ear detection and 
counting system based on a convolutional neural 
network. Gao" utilized YOLOv3 and Mask Re- 
gion-Convolutional Neural Networks (R-CNN) to 


directly detect wheat ear in the field and achieve 
mAP 87.12%. Alkhudaydi et al."" developed a fully 
convolutional model to estimate wheat ear from 
high-resolution RGB images. Xie et al."” developed 
a Feature Cascade SVM (FCS R-CNN) wheat ear 
detection method and obtain mAP 81.22%. 

Although deep learning based detection meth- 
ods have been applied to the wheat ear detection 
filed and achieved good result, there are few ap- 
proaches developed for foxtail millet ear detection. 
Thus, in this research, the foxtail millet ear detec- 
tion work was explored and an effective method 
was proposed. 

Considering the promising detection capacity, 
YOLOv4" was employed to perform foxtail millet 
ear detection and further facilitate the foxtail millet 
ear counting. Furthermore, in order to make the YO- 
LOv4 model applicable for the specific foxtail mil- 
let ear detection task effectively, the size of the an- 
chor boxes in the model via K-means method based 
on the foxtail millet ear detection dataset was adjust- 
ed. Through the adjustment, the performance of the 
foxtail millet ear detection has been enhanced. Mil- 
let ear detection dataset was collected from the 
farmland, which contains about 784 images and 
with about 10,000 foxtail millet ear samples in to- 
tal. Among them, 588 images were utilized for train- 


ing and the rest for testing. 
2 Data set 


The data was collected from the foxtail millet 
experimental field,the varieties including Male ster- 
ile line GBS, Datong 27 and Dragon Claw, at 
Shanxi Agricultural University in Taigu county, Jin- 


zhong city, Shanxi province. 
2.1 Data collection 


The duration of data collection was one month, 
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starting in August 10th and ending in September 
10th, 2020. To make the samples diverse and rich, 
the data were collected every other day at 10 a.m., 
and 3 kinds of foxtail millet samples including 
Male sterile line GBS, Datong27 and Dragon Claw 
were collected. Therefore, the collected data con- 
tains samples under different light conditions and 
different weather. 

A white box on foxtail millet ear was added in 
a specific area, and it was taken as one original data 


sample. Considering the weight, deformation and 


convenience, the white box was made of PVC 
pipes, the size was 0.5 m (width) x 0.6 m (length) x 
0.5 m (height). Among them, the white box was 0.5 m 
from the ground. The camera was Canon EOS 70D 
with 35 mm focal length since it could obtain high- 
resolution sample images, and the distance between 
the camera lens and the white frame was 1.5—2 m. 
The storage format of the collected data was 
* jpg and the resolutions were 4864x3648 px. Fig. 1 


shows several foxtail millet ear samples. 


(a) sample 1 


(b) sample 2 


(c) sample 3 


Fig. 1 Examples of foxtail millet ear 


2.2 Data cleaning 


In order to achieve a promising training model, 
the images with blurry foxtail millet ear and exces- 
sive weeds were eliminated, to reduce the impact of 
the background and degradation of the images on 
the accuracy of network detection. 

Effective models usually depend on accurate 
data annotation. To achieve well data annotation, la- 
belImg"™ was employed to label the dataset. Specifi- 
cally, each foxtail millet ear in the corresponding 
white box in one image was annotated by using a 
rectangular box which represented by the coordi- 
nates of its four vertices. After all foxtail millet ears 
in the corresponding white box in one image were 
labeled, a corresponding XML file was generated. 


In the XML file, all information was stored in 


the annotation tag, which contained the size of the 
image, the name of the label frame, and the location 
of the target frame. Subsequently, the XML file gen- 
erated by the corresponding image was converted 
into a text file as a network input. 

Finally, the foxtail millet ear detection dataset 
contained 784 images and 10,000 foxtail millet ear 
samples in total. Among them, 80% were adopted 
as the training set, and the rest were adopted as the 
test set. Specifically, 588 training images and 196 


test images were contained in the dataset. 


3 Methods and experiment 


3.1 The YOLO models 


YOLO!” is an excellent model for object detec- 


tion, which can well balance the detection speed 
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and the accuracy. YOLO belongs to the one-stage 
detection approach, which can detect objects direct- 
ly without generating candidate proposals. 

YOLO performs foxtail millet ear detection 
through the following steps. First, the features are 
extracted from the input foxtail millet ear's image 
via the feature extraction network to obtain a NxN 
px feature map. Then, the input image is divided in- 
to NxN grid cells. If the center coordinate of an ob- 
ject in ground-truth falls into one grid cell, this grid 
cell can be employed to predict the object. The grid 
cell will predict M bounding boxes with different 
sizes, and the bounding box with the largest Inter- 
section over Union (IOU) was utilized to predict the 
object. Specifically, each bounding box contains 5 
prediction values: ¢,, ¢,, ¢,, t, and confidence. t, t, t, 
t, denote the center coordinates and width and 
height of the model prediction; confidence denotes 
the trust level and prediction accuracy of the pre- 
dicted box. Based on Equation (1), the center coor- 
dinates (c, c,), width c, and height c,of the predict- 


ed box can be calculated. 


C, > Oo (t,) a b, 
G= a(t,) +b, 
: a) 
Cy, = Pye 
Ch = p,e" 


Where, o (x )indicates the Logistic function; t, 
t, t» t, denote the center coordinates and width and 
height of the model prediction; p„ and p, are the 
width and height of the prior box relative to the fea- 
ture map; b, and b, are the coordinates of the upper 


left corner of each grid in the feature map. 
3.2 Structure of YOLOv4 model 


The network structure of the YOLOv4 model 
is shown in Fig.2 The backbone of YOLOV4 is CSP- 
Darknet-53, which integrates 5 CSP modules into 
the Darknet-53 model. Specifically, CSPDarknet53 


includes 29 convolutional layers with kernel size of 
3x3, and a receptive field of 725x725. In total, it 
has 27.6 M parameters. Benefiting from the advan- 
tages of CSPNet in reducing computational costs, 
maintaining high accuracy, and reducing memory 
consumption while light weighting the model, YO- 
LOv4 adds CSP to each large residual block of 
Darknet-53. Meanwhile, the feature mapping of the 
basic layer is separated into two parts and then 
merged through the cross-stage hierarchical struc- 
ture to guarantee accuracy while reducing the 
amount of calculation''*""., 

The current object Fig. 2 detector is mainly 
composed of 4 modules, including Input, Backbone, 
Neck, and Head, respectively. The Input of YO- 
LOv4 employs Mosaic data augmentation. 

The neck of YOLOv4 is the spatial pyramid 
pooling and the path aggregation network (PANet). 
Specifically, the spatial pyramid pooling block is 
added over the CSPDarknet53 backbone. Spatial 
Pyramid Pooling (SPP) can markedly increase the 
receptive field and extracts the most important con- 
text features, without reducing the operating speed 
of the network. The maximum pooling sizes of spa- 
tial pyramid pooling are 5x5, 9x9, and 13x13. 
Moreover, the PANet is utilized to aggregate the pa- 
rameters from different backbone levels". 

In order to detect foxtail millet ear individuals 
of different sizes, the idea of anchor box"! was pre- 
sented. The anchor gave an initial value of the tar- 
get width and height, which was often utilized to 
make a rough judgment on the size of the target in- 
dividual, to avoid the model to blindly learn the tar- 
get position and target scale in the training process. 
Since the foxtail millet ear object is smaller than the 


general object, the K-means algorithm"”! 


was em- 
ployed to adaptively obtain the anchor box based on 


the foxtail millet ear detection dataset. The regres- 
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Fig. 2 The architecture of YOLOv4 


sion task operation with high relative offset could 
predict the target width and height more accurately. 
Specifically, the YOLOv4 model utilizes the 
K-means algorithm to generate a total of 9 anchor 
boxes, the sizes of which are (5, 7), (6, 12), (9, 8), 
(7, 18), (10, 13), (13, 10), (10, 21), (14, 15) and (17, 
25), where the coordinates (X, Y) represent the 
width and height of the anchor box respectively, 
and the first 3 anchor point boxes were used to 
detect smaller foxtail millet ear individual, the 
middle 3 anchor boxes were employed to detect me- 
dium-sized foxtail millet ear individuals, the last 3 
anchor boxes were suitable for detecting larger fox- 


tail millet ear individuals. 
3.3 The loss function 


YOLOv4 utilizes the following loss function 
to train the model, including 3 items, namely local- 
ization loss, confidence loss, and classification loss 
respectively. 

The confidence loss denotes the confidence of 
the prediction box on the detected object. The confi- 


dence loss is formulated as: 


S B A 
L cong = DDN [ Flog (F}) + (l = 


i=0j=0 
A 

Fj )log(1 -F})]- 

Ari 2, DL” L F} log (F?) + (1 - 

i=0j=0 
A 

F/)log(1-F!)] (2) 
Where, S’ and B indicate the scale of the fea- 
ture map and the priority box. À 


is a hyperparam- 


noobj 
eter, which is utilized to balance the corresponding 
two terms. F / and F? indicate the conferences of the 
annotated and predicted boxes. I?” and Ip” are indi- 
cators. If there is a target at the jth prior box of the 
ith grid, 1°” and Ip” take values 1 and 0 respective- 
ly. Otherwise, I?” and Ip% take values 0 and 1 re- 
spectively. 

The localization loss indicates the error be- 
tween the real box and the predicted bounding box, 
which is only for the target box responsible for de- 
tection. The localization loss (Complete Intersection 
Over Union Loss, CIOU Loss) of YOLOv4 is ex- 


pressed as: 


68 


ChinaXivA ERAT! 


202302.00222v1 


chinaXiv 


s2 


E (ec) 
L aiou = £ pie =+0U + O Po + 


i=0 j=0 
16 ef 
— (arctan ~— -arctan ~~ y (3) 
mi he h 
4 w“ w l 
1 -IOU + a (arctan E -arctan —- y 
Where, JOU can be denoted as: 
CAD 
IOU = (4) 
CUD 


Where, C and D indicate the ground truth and 
predicted bounding box; JOU denotes the ratio of in- 
tersection and union corresponding of these two 
boxes. JOU is a indicator to measure the accuracy 
of the predicted box. The larger the JOU, the more 
accurate the position of the predicted box; where, 
d( ) is the Euclidean Distance; / is the diagonal dis- 
tance between the predicted box and the ground 
truth box closure area; c, w, and A denote the center 
coordinates, width, and height of the predicted box. 
c, w* and h* denote the center coordinates, width, 
and height of the ground truth box. 

The classification loss is represented as: 


s? A A 
Ly =D $ [ Pilog(P/) +(1-P!) 
i=0 


= cecls 
log (1 -P’)] (5) 
Where, La is the classification loss, which is 
utilized to identify whether the object in box is the 
A 
target object (foxtail millet ear). P! and P} denote 


the class probabilities of annotated and predicted 


boxes. 
Thus, the whole loss of YOLOV4 is: 
L = Liou Tli + Las (6) 


3.4 Experiments setup 


The hardware configurations for the experi- 
ments were GTX TITANXP 12G graphics card and 
the I7 7800 X processor. The software configura- 
tions were demonstrated as follows: CUDA10.1, 
CUDNN7.6.4, python3.6.9. The experiments were 


processed based on PyTorch. 


The parameters for experiments were set up as 
follows: the learning rate was 0.001, the number of 
training iterations was 12,000, and the momentum 


was 0.949. 
3.5 Evaluation criteria 


To validate the performance of the model, sev- 
eral important evaluation criteria were employed, in- 
cluding Precision, Recall, F\-score, and mean Aver- 
age Precision (mAP). 

Among them, precision evaluates the accuracy 
of the model prediction; Recall indicates whether 
the model was completely searching for the target; 
Fl-score is a promising measurement of classifica- 
tion tasks, which is the harmonic mean of Preci- 
sion and Recall. The maximum and the minimum 
of F'l-score are 1 and 0, respectively. MAP refers to 
the average detection accuracy of the model. When 
calculating the mAP, its definition is consistent with 
that of Pattern Analysis, Statical Modeling and 
Computational Learning, Visual Object Classes 
2007 (PASCAL VOC2007)""*!, and the detection is 
correct when the JOU threshold of the detection box 
and the manual labeling box exceeds a certain value 
and the category prediction confidence score ex- 
ceeds a certain value”. 

The definitions of Precision, Recall, F1-score, 
and mAP are shown in Equations (7)— (10). 

TP 


Precision = ————— (7) 
TP + FP 
TP 
Recall = —~-— 8 
ecall TP + FN (8) 
xR 
F1-score=2 (9) 
P+R 
1 
mAP = [P x Rdr (10) 
0 


Where, TP, FP, FN indicates the number of 
True Positive, false positive and false-negative 


samples. 
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4 Results and analysis 


4.1 Evaluation of three models 


To verify the effectiveness of YOLOv4 on fox- 
tail millet ear detection, three models of Faster- 
RCNN, YOLOv2, and YOLOv3 were utilized to 
compare with it. Table 1 shows the comparison re- 
sults of Precision, Recall, F1-score, and mAP. 
These three models used the same training parame- 
ters. During the test, the confidence and JOU thresh- 
olds were set as 0.35 and 0.5, respectively. It im- 
plies that when the prediction confidence score is 
greater than 0.35 and the predicted JOU is greater 
than 0.5, the corresponding sample is considered to 
be correct. 

Results in Table 1 reveals that YOLOv4 ob- 
tained significantly better results than Faster- 
RCNN, YOLOv2 and YOLOv3 on all four criteria 
of Precision, Recall, F\-score, and mAP. Specifical- 
ly, the Precision of YOLOv4 beats Faster-RCNN, 
YOLOv2 and YOLOV3 by 1.9%, 13% and 1%, the 
Recall of YOLOv4 surpasses Faster-RCNN, YO- 
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(c) Recall curves comparison of different models 


Table 1 Comparison results of different models with parame- 


ters score = 0.35 and JOU = 0.5 


Models Precision/% Recall/% F1-score/% mAP 


Faster-RCNN 85.34 75.66 80.85 76.00 
YOLOv2 77.00 73.00 75.00 71.52 
YOLOv3 86.00 77.00 81.00 76.96 
YOLOv4 87.00 79.00 83.00 78.99 


LOv2 and YOLOv3 by 4.4%, 8.2% and 2.6%, the 
Fl-score of YOLOv4 better than Faster-RCNN, 
YOLOv2 and YOLOV3 by 2.6%, 10.6% and 2.4%. 
Furthermore, YOLOv4 acquires 3.9%, 10.4% and 
2.6% better mAP than those of Faster-RCNN, YO- 
LOv2 and YOLOv3. 

Moreover, Fig.3 shows the comparison results 
of different models under different iterations. In 
Fig.3, the curves of mAP, Precision, Recall and F1- 
score of YOLOV4 are all above those of YOLOv2 
and YOLOv3, which indicates that the performance 
of YOLOv4 is superior to YOLOv2 and YOLOv3 
in mAP, Precision, Recall and F'1-score. Clearly, the 
results validate the effectiveness of the YOLOv4 


throughout the whole training phase. 
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(b) Precision curves comparison of different models 
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Fig. 3 Comparison curves of different models under different iteration times 
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Besides, the qualitative comparison results of 


different models, including YOLOv2, YOLOv3, 
and YOLOv4 are shown in Fig. 4 Obviously, the 


YOLOv4 obtains more accurate predicted boxes 


than YOLOv2 and YOLOv3 in the case of occlu- 


sion and other situations. 


(a) Annotated images of ground truth 


(c) Qualitative results of YOLOv3 


(d) Qualitative results of YOLOv2 


Fig. 4 Qualitative comparison results of YOLOv2, YOLOv3 and YOLOv4 models 


There are several reasons that YOLOv4 ob- 
tained better results. 

(1) YOLOv4 has integrated Mosaic data aug- 
mentation. Specifically, 4 training images were 
mixed by using Mosaic, and thus 4 different con- 


texts were merged. This permits the objects outside 


of their normal context to be detected, which will 
enhance the detection performance. Further, activa- 
tion statistics based on 4 distinct images on each 
layer can be calculated by using batch normaliza- 
tion. This greatly reduces the requirement for large 


mini-batch size in the training phase. 
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(2) The backbone of YOLOV4 leverages CSP- 
Darknet53, Mish activation function, and Drop- 
block strategies. Specifically, CSPDarknet53 added 
a Cross Stage Partial structure to each group of 
blocks, which will significantly improve the perfor- 
mance of the model. Mish activation is smooth, 
which allows better information to be passed into 
the deep neural network, and leads to better accura- 
cy and generalization. The dropblock strategy al- 
lows the number of dropout units to be gradually in- 
creased during the training process, thereby improv- 
ing the accuracy of training and the robustness to 
hyperparameter selection. 

(3) YOLOv4 employs Complete Intersection 
Over Union (CIOU) loss and Distance Intersection 
over Union-Non Maximum Suppression (DIOU- 
NMS), which will further improve the convergence 


speed and regression accuracy. 
4.2 Evaluation of different /OU thresholds 


Different from the common object detection, 
the foxtail millet ears in the image in this research 
were small and densely distributed. In order to find 
the most suitable specific JOU value for the foxtail 
millet ear detection dataset, the performance of the 
model with different JOU values were explored. 
Specifically, the confidence score was fixed as 0.35, 
and then the JOU was set to 0.2, 0.35, 0.5, and 0.65 
on the test set, respectively. The comparison results 
are shown in Table 2. 

With the increase of the JOU threshold, the 
evaluation criteria changed significantly. That is, all 
the evaluation criteria of the three models showed a 
downward trend as the JOU increased. Although the 
evaluation criteria were higher when the JOU was 
0.2 and 0.3, the JOU of the predicted box and the 
ground-truth box was too small, thus the prediction 


of foxtail millet ear was not convincing. On the oth- 


Table 2 The impact of JOU values on the performance of 


different models 


model IOU — Precision/% Recall/% F\/-score% mAP/% 


0.20 85.00 81.00 83.00 84.05 

0.35 83.00 80.00 81.00 81.03 
YOLOv2 

0.50 77.00 73.00 75.00 71.52 

0.65 53.00 51.00 52.00 40.64 

0.20 91.00 81.00 86.00 84.05 

0.35 90.00 81.00 85.00 82.36 
YOLOv3 

0.50 86.00 77.00 81.00 76.96 

0.65 65.00 59.00 62.00 48.55 

0.20 92.00 84.00 87.00 85.01 

0.35 91.00 83.00 87.00 83.69 
YOLOv4 

0.50 87.00 79.00 83.00 78.99 

0.65 70.00 64.00 67.00 56.38 


er hand, when the JOU was 0.65, the performance 
of the model dropped sharply by 20%—30%. The 
detection performance is too poor, which means 
that many ears can not be detected. Therefore, in 
balanced, the JOU was choosen as 0.5, which has a 


good JOU and detection performance. 


4.3 Evaluation of the foxtail millet ear de- 
tection with/without anchor boxes ad- 
justment 


Models were compared in order to evaluated 
the foxtail millet ear detection with/without anchor 
box adjustment, including YOLOv3 and YO- 
LOv3_adj, YOLOv4 and YOLOv4 adj respective- 
ly. Among them, YOLOv3_adj and YOLOv4_adj in- 
dicated the corresponding models with adaptively 
adjusted anchor boxes based on K-means algorithm. 
Specifically, the anchor boxes for YOLOv3 were (3, 
5), (4, 8), (6, 5), (5, 12), (7, 9), (9, 7), (7, 14), (10, 
10), (11, 17) and for YOLOv4 were (5, 7), (6, 12), 
(9, 8), (7, 18), (10, 13), (13, 10), (10, 21), (14, 15) 
and (17, 25), respectively. The comparison results 
are shown in Table 3. 

Table 3 shows that YOLOv3 adj and YO- 
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Table 3 Comparison results of models with/without anchor 


boxes adjustment 


Model Precision/%  Recall/% Fl-score/% mAP 
YOLOv3 86.00 77.00 81.00 76.90 
YOLOv3_adj 87.00 78.00 81.00 77.16 
YOLOv4 87.00 79.00 83.00 78.99 
YOLOv4 adj 87.00 80.00 83.00 80.87 


LOv4 adj are superior to those of YOLOv3 and 
YOLOV4 respectively, which verify the effective- 
ness of the adaptively anchor boxes. The reasons 
are that the adjusted anchor boxes have high rela- 
tive offset on foxtail millet ear, which allows the 
model to predict the width and height of object 


more accurately. 


4.4 Evaluation of the changing reasons 
of model criteria 


Equations (7) — (10) show that the predicted 
TP and FP values are directly related to the model 
performance. In order to explore the underlying rea- 
sons for the difference in the prediction of various 
models, the TP and FP values of different models 
on the test set were analyzed. The obtaining of the 
TP and FP values required a two-step filtering oper- 
ation on the model predicted class box: 1) removing 
some prediction boxes below a certain confidence 
threshold (such as 0.5), and 2) the filtered predic- 
tion boxes were sorted in descending order accord- 
ing to the confidence value, and the JOU between 
the ground truth box and the predicted box with the 
highest confidence value was calculated. If the JOU 
exceeding the set threshold (the JOU threshold was 
set to 0.35), the current predicted box would be 
treated as a true positive sample and added to TP. 
Meanwhile, the corresponding foxtail millet ear 
would be marked as tested, and all subsequent pre- 


dicted box for this foxtail millet ear would be treat- 


ed as FP. The final statistical results are presented 
in Table 4. 
Table 4 7P and FP values predicted by the experimental 


models for the ear target 


TP incre- FP incre- mAP incre- 
Model TP FP 


ment ment ment 
YOLOv2 2052 623 0 0 0 
YOLOv3 2157 352 105 -271 5.44 
YOLOv4 2220 329 168 -294 7.47 


The test results indicated that the higher the TP 
value, the better the model performs. As can be seen 
from Table 4, the YOLOv4 model is better than oth- 
er models, and YOLOv3 is superior to YOLOv2. 
Specifically, the TP value of the YOLOv4 model is 
63 higher than that of YOLOv3 (increased by 
2.92%), the FP value is 23 smaller than that of YO- 
LOv3 (decreased by 6.53%), and mAP increased by 
2.63%. Furthermore, the TP value of the YOLOv4 
model is 168 higher than that of YOLOv2 (in- 
creased by 8.19%), the FP value is 294 smaller than 
that of YOLOv2 (decreased by 47.19%), and mAP 
increased by 10.44%. Moreover, the TP value of the 
YOLOv3 model is 105 more than that of YOLOv2 
(increased by 5.12%), the FP value is 271 less than 
that of YOLOv2 (decreased by 43.50%), and mAP 
increased by 7.60%. From the above description, it 
could be concluded that the change ratio of mAP 
was similar to those of 7P. Better detection results 


require a higher number of 7P. 


4.5 Evaluation of the foxtail millet ear de- 
tection with different input original im- 
age size 


As the collected original foxtail millet ear im- 
age with size of 4864x3648 px and the input dimen- 
sion for YOLOv4 model was 608x608 px, this was 
a large resize ratio and required the foxtail millet 


ear to be very small. Consequently, the foxtail mil- 
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let ear detection with small input original image 
size by cutting the original image through removing 
the unnecessary parts were studied. Specifically, the 
size of cutting foxtail millet ear images are 2000x 
1500 px. The comparison results are shown in Ta- 
ble 5. 

Table 5 The impact of different original image size on the 


foxtail millet ear detection 


Original im- YOLOv4 in- 
: Precision/% Recall/% F\-score/% mAP 
age size/px put size/px 


4864x3648 608x608 87 79.00 83.00 78.99 
2000x1500 608x608 92 84.00 88.00 83.50 


Table 5 shows that the cutting foxtail millet ear 
image obtained better results than that of foxtail mil- 
let ear image without cutting preprocessing, it is be- 
cause that the images with cutting preprocessing by 
removing the unnecessary parts could reduce the re- 
size ration before the image inputting into the corre- 
sponding detection model. Thus, the enhanced the 
foxtail millet ear detection performance could be 


achieved. 


5 Conclusions 


In this research, an adaptive anchor adjustment 
foxtail millet ear detection approach based on YO- 
LOv4 was proposed, which has obtained promising 
detection results. Firstly, a novel relatively large 
foxtail millet ear detection dataset collected from 
the farmland was established, which contained 784 
images and 10,000 ear samples in total. Then, the 
detection model YOLOv4 with proposed adaptive 
anchor adjustment was applied to perform the task 
of foxtail millet ear detection. Extensive experi- 
ments have been performed to validate the availabil- 
ity of the established dataset and the effectiveness 
of YOLOv4. Experimental results revealed that YO- 
LOv4 obtained best detection performance for fox- 


tail millet ear detection than other models (YO- 


LOv2, YOLOv3) in terms of all evaluation criteria. 
The detection of other millet ear categories 
elsewhere in the world were not been explored. Fur- 
thermore, the scale of foxtail millet ear detection da- 
taset in this research was medium-sized, which 
should be increased in the future. And more effec- 
tive approaches for detecting foxtail millet ear will 


be explored. 
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